Accessing Databases via Web APIs

What is an API?

  • API stands for Application Programming Interface

  • Broadly defined: a set of rules and procedures that facilitate interactions between computers and their applications

  • A very common type of API is the Web API, which (among other things) allows users to query a remote database over the internet

  • Web APIs take on a variety of formats, but the vast majority adhere to a particular style known as Reperesentational State Transfer or REST

  • What makes these “RESTful” APIs so convenient is that we can use them to query databases using URLs

RESTful Web APIs are All Around You…

Consider a simple Google search:

Ever wonder what all that extra stuff in the address bar was all about? In this case, the full address is Google’s way of sending a query to its databases asking requesting information related to the search term “golden state warriors”.

In fact, it looks like Google makes its query by taking the search terms, separating each of them with a “+”, and appending them to the link “https://www.google.com/#q=”. Therefore, we should be able to actually change our Google search by adding some terms to the URL and following the general format…

Learning how to use RESTful APIs is all about learning how to format these URLs so that you can get the response you want.

Some Basic Terminology

  • Uniform Resource Location (URL): a string of characters that, when interpreted via the Hypertext Transfer Protocol (HTTP), points to a data resource, notably files written in Hypertext Markup Language (HTML) or a subset of a database. This is often referred to as a “call”.

  • HTTP Methods/Verbs:

    • GET: requests a representation of a data resource corresponding to a particular URL. The process of executing the GET method is often referred to as a “GET request” and is the main method used for querying RESTful databases.

    • HEAD, POST, PUT, DELETE: other common methods, though mostly never used for database querying.

How Do GET Requests Work? A Web Browsing Example

As you might suspect from the example above, surfing the web is basically equivalent to sending a bunch of GET requests to different servers and asking for different files written in HTML.

Suppose, for instance, I wanted to look something up on Wikipedia. My first step would be to open my web browser and type in http://www.wikipedia.org. Once I hit return, I’d see the page below.

Several different processes occured, however, between me hitting “return” and the page finally being rendered. In order:

  1. The web browser took the entered character string and used the command-line tool “Curl” to write a properly formatted HTTP GET request and submitted it to the server that hosts the Wikipedia homepage.

  2. After receiving this request, the server sent an HTTP response, from which Curl extracted the HTML code for the page (partially shown below).

  3. The raw HTML code was parsed and then executed by the web browser, rendering the page as seen in the window.

[1] "<!DOCTYPE html>\n<html lang=\"mul\" dir=\"ltr\">\n<head>\n<!-- Sysops: Please do not edit the main template directly; update /temp and synchronise. -->\n<meta charset=\"utf-8\">\n<title>Wikipedia</title>\n<!--[if lt IE 7]><meta http-equiv=\"imagetoolbar\" content=\"no\"><![endif]-->\n<meta name=\"viewport\" content=\"initial-scale=1.0, user-scalable=yes\">\n<link rel=\"apple-touch-icon\" href=\"/static/apple-touch/wikipedia.png\">\n<link rel=\"shortcut icon\" href=\"/static/favicon/wikipedia.ico\">\n<link rel=\"license\" href=\"//creativecommons.org/licenses/by-sa/3.0/\">\n<link rel=\"stylesheet\" href=\"//meta.wikimedia.org/w/load.php?debug=false&amp;lang=en&amp;modules=ext.gadget.wm-portal&amp;only=styles&amp;skin=vector&amp;*\">\n<style type=\"text/css\">\n.central-featured-logo{background:url(//upload.wikimedia.org/wikipedia/meta/0/08/Wikipedia-logo-v2_1x.png) center center no-repeat;left:0;top:0;width:100%;height:100%}.bookshelf{display:block}.bookshelf-container{font-size:13px;font-weight:700;line-height:20px}.bookshelf-con"

Web Browsing as a Template for RESTful Database Querying

The process of web browsing described above is a close analogue for the process of database querying via RESTful APIs, with only a few adjustments:

  1. While the Curl tool will still be used to send HTML GET requests to the servers hosting our databases of interest, the character string that we supply to Curl must be constructed so that the resulting request can be interpreted and succesfully acted upon by the server. In particular, it is likely that the character string must encode search terms and/or filtering parameters, as well as one or more authentication codes. While the terms are often similar across APIs, most are API-specific.

  2. Unlike with web browsing, the content of the server’s response that is extracted by Curl is unlikely to be HTML code. Rather, it will likely be raw text response that can be parsed into one of a few file formats commonly used for data storage. The usual suspects include .csv, .xml, and .json files.

  3. Whereas the web browser capably parsed and executed the HTML code, one or more facilities in R, Python, or other programming languages will be necessary for parsing the server response and converting it into a format for local storage (e.g. matrices, dataframes, databases, lists, etc.).

STEP 1: Finding Data Resources

To determine the popularity of something, we need a measurement of how frequently or widely it is referenced or encountered. Moreover, to determine how this popularity changes over time, we need a measurement that is taken repeatedly.

Newspapers are an excellent source of such information. The frequency with which certain items appear in its pages can be a decent metric of its popularity, and its continual publication creates a built-in time series. And while there are a variety of newspapers to choose from, we’ll be working with the New York Times for a variety of reasons — including its status as a paper of record, its long publishing history, and (most importantly) its convenient article API.

NYT Article API

STEP 2: Getting API Access

For most APIs, a key or other user credentials are required for any database querying. Generally, this requires that you register with the organization. Most APIs are set up for developers, so you’ll likely be asked to register an “application”. All this really entails is coming up with a name for your app/bot/project, and providing your real name, organization, and email. Note that some more popular APIs (e.g. Twitter, Facebook) will require additional information, such as a web address or mobile number.

Once you’ve successfully registered, you will be assigned one or more keys, tokens, or other credentials that must be supplied to the server as part of any API call you make. To make sure that users aren’t abusing their data access privileges (e.g. by making many rapid queries), each set of keys will be given several rate limits governing the total number of calls that can be made over certain intervals of time. For the NYT Article API, we have relatively generous rate limits — 10 calls per second and 10,000 calls per day.

NYT Article API Keys

STEP 3: Learning how to Construct API GET Requests

Likely the most challenging part of using web APIs is learning how to format your GET request URLs. While there are common architectures for such URLs, each API has its own unique quirks. For this reason, carefully reviewing the API documentation is critical.

Most GET request URLs for API querying have three or four components:

  1. Base URL: a link stub that will be at the beginning of all calls to a given API; points the server to the location of an entire database

  2. Search Parameters: a character string appended to a base URL that tells the server what to extract from the database; basically a series of filters used to point to specific parts of a database

  3. Authenication Key/Token: a user-specific character string appended to a base URL telling the server who is making the query; allows servers to efficiently manage database access

  4. Response Format: a character string indicating how the response should be formatted; usually one of .csv, .json, or .xml

Fortunately, the NYT Article API is very well documented!

STEP 4: Constructing API GET Requests in R

Because using Web APIs in R will involve repeatedly constructing different GET requests with slightly different components each time, it is helpful to store many of the individuals components as objects and combine them using paste() when ready to send the request.

In the first place, we know that every call will require us to provide a) a base URL for the API, b) some authorization code or key, and c) a format for the response.

# Create objects holding the key, base url, and response format
key<-"ef9055ba947dd842effe0ecf5e338af9:15:72340235"
base.url<-"http://api.nytimes.com/svc/search/v2/articlesearch"
response.format<-".json"

Secondly, we need to specify our search terms, along with any filters to be placed on the results. In this case, we are searching for the phrase “jazz is dead”, though we specifically want it to appear in the body of the text.

# Specify a main search term (q)
search.term<-"jazz is dead"

# Specify and encode filters (fc)
filter.query<-"body:\"jazz is dead\"" 

Note that it can often be tricky to properly re-format character strings stored in R objects to character strings suitable for GET requests. For example, the filter above uses quotation marks to specify that we wanted to retrieve the phrase exactly. But to include those quotation marks inside a character string that — following R syntax — must itself be surrounded by double quotation marks, these original characters need to be escaped with a backslash. This results in the stored R string appearing to be different from the parsed R string.

# NOTE: double quotes within double quotes must be escaped with / so R can parse the character string
print(filter.query) # How R stores the string
[1] "body:\"jazz is dead\""
cat(filter.query) # How R parses the string
body:"jazz is dead"

To overcome some of these encoding issues, it is often helpful to URL encode our strings. URL encoding basically translates punctuation marks, white space, and other non alphanumeric characters into a series of unique characters only recognizeable by URL decoders. If you’ve ever seen %20 in a URL, this is actually a placeholder for a single space. R provides helpful functions to doing this translation automatically.

# URL-encode the search and its filters
search.term<-URLencode(URL = search.term, reserved = TRUE)
filter.query<-URLencode(URL = filter.query, reserved = TRUE)
print(search.term)
[1] "jazz%20is%20dead"
print(filter.query)
[1] "body%3a%22jazz%20is%20dead%22"

Once all the pieces of our GET request are in place, we can use either the paste() or paste0() to combine a number of different character strings into a single character string. This final string will be our URL for the GET request.

# Paste components together to create URL for get request
get.request<-paste0(base.url, response.format, "?", "q=", search.term, "&fq=", filter.query, "&api-key=", key)
print(get.request)
[1] "http://api.nytimes.com/svc/search/v2/articlesearch.json?q=jazz%20is%20dead&fq=body%3a%22jazz%20is%20dead%22&api-key=ef9055ba947dd842effe0ecf5e338af9:15:72340235"

Once we have the URL complete, we can send a properly formated GET request. There are several packages that can do this, but httr provides a good balance of simplicity and reliability. The main function of interest here is GET():

# Send the GET request using httr package
response<-httr::GET(url = get.request)
print(response)
Response [http://api.nytimes.com/svc/search/v2/articlesearch.json?q=jazz%20is%20dead&fq=body%3a%22jazz%20is%20dead%22&api-key=ef9055ba947dd842effe0ecf5e338af9:15:72340235]
  Date: 2015-06-29 13:56
  Status: 200
  Content-Type: application/json; charset=UTF-8
  Size: 15.1 kB

The content() function allows us to extract the html response in a format of our choosing (raw text, in this case):

# Inspect the content of the response, parsing the result as text
response<-httr::content(x = response, as = "text")
substr(x = response, start = 1, stop = 1000)
[1] "{\"response\":{\"meta\":{\"hits\":27,\"time\":32,\"offset\":0},\"docs\":[{\"web_url\":\"http:\\/\\/www.nytimes.com\\/2003\\/03\\/05\\/arts\\/garcia-s-shadow-but-his-own-sound-jimmy-herring-dead-s-new-member-helps.html\",\"snippet\":\"Let's say you are the replacement for Jerry Garcia in the Grateful Dead. Your new job is to go in front of 20,000 fans and play music that they probably know better than you do. Should you play like Him? Ought you dare try?    These were among the...\",\"lead_paragraph\":\"Let's say you are the replacement for Jerry Garcia in the Grateful Dead. Your new job is to go in front of 20,000 fans and play music that they probably know better than you do. Should you play like Him? Ought you dare try? These were among the questions facing Jimmy Herring, an affable and studious 41-year-old guitarist steeped in Southern rock and schooled in jazz-fusion but basically a newcomer to the Dead until a brief, unsatisfying tenure five years ago with a band that played Dead songs. Last year the surviving"

The final step in the process involves converting the results from JSON format to something easier to work with – notably a data.frame. The jsonlite package provides several easy conversion functions for moving between JSON and vectors, data.frames, and lists.

# Convert JSON response to a dataframe
response.df<-jsonlite::fromJSON(txt = response, simplifyDataFrame = TRUE, flatten = TRUE)

# Inspect the dataframe
str(response.df, max.level = 3)
List of 3
 $ response :List of 2
  ..$ meta:List of 3
  .. ..$ hits  : int 27
  .. ..$ time  : int 32
  .. ..$ offset: int 0
  ..$ docs:'data.frame':    10 obs. of  21 variables:
  .. ..$ web_url         : chr [1:10] "http://www.nytimes.com/2003/03/05/arts/garcia-s-shadow-but-his-own-sound-jimmy-herring-dead-s-new-member-helps.html" "http://query.nytimes.com/gst/abstract.html?res=9B04E7D81131E53ABC4A51DFB667838C679EDE" "http://www.nytimes.com/1998/05/31/arts/pop-jazz-a-welcome-truce-for-jazz-festivals.html" "http://www.nytimes.com/2001/01/07/arts/music-watching-jazz-for-its-high-notes-and-low.html" ...
  .. ..$ snippet         : chr [1:10] "Let's say you are the replacement for Jerry Garcia in the Grateful Dead. Your new job is to go in front of 20,000 fans and play"| __truncated__ "NOBODY is quite sure any more just what jazz is.  There are too many hybrids by now. And with the beginning of abstraction 10 y"| __truncated__ "EVER since the Newport Jazz Festival moved to New York in 1972, the idea of a jazz festival in the city has been a sore spot fo"| __truncated__ "For the filmmaker Ken Burns, the history of jazz and American history are inextricable -- the nation's triumphs and defeats, jo"| __truncated__ ...
  .. ..$ lead_paragraph  : chr [1:10] "Let's say you are the replacement for Jerry Garcia in the Grateful Dead. Your new job is to go in front of 20,000 fans and play"| __truncated__ "NOBODY is quite sure any more just what jazz is. There are too many hybrids by now. And with the beginning of abstraction 10 ye"| __truncated__ "EVER since the Newport Jazz Festival moved to New York in 1972, the idea of a jazz festival in the city has been a sore spot fo"| __truncated__ "For the filmmaker Ken Burns, the history of jazz and American history are inextricable -- the nation's triumphs and defeats, jo"| __truncated__ ...
  .. ..$ abstract        : chr [1:10] "Interview with guitarist Jimmy Herring, who now plays with Grateful Dead, trying to fill void left by late Jerry Garcia; photos"| __truncated__ "McRae, B: The Jazz Cataclysm" "Peter Watrous welcomes move by Michael Dorf to open upstart Texaco Jazz Festival several weeks earlier than ususual so that it "| __truncated__ "Representation of jazz in Ken Burns's documentary series Jazz is assessed by musicians and others in jazz world: Martha Bayles,"| __truncated__ ...
  .. ..$ print_page      : chr [1:10] "1" "336" "32" "33" ...
  .. ..$ blog            :List of 10
  .. ..$ source          : chr [1:10] "The New York Times" "The New York Times" "The New York Times" "The New York Times" ...
  .. ..$ multimedia      :List of 10
  .. ..$ keywords        :List of 10
  .. ..$ pub_date        : chr [1:10] "2003-03-05T00:00:00Z" "1967-10-22T00:00:00Z" "1998-05-31T00:00:00Z" "2001-01-07T00:00:00Z" ...
  .. ..$ document_type   : chr [1:10] "article" "article" "article" "article" ...
  .. ..$ news_desk       : chr [1:10] "The Arts/Cultural Desk" NA "Arts and Leisure Desk" "Arts and Leisure Desk" ...
  .. ..$ section_name    : chr [1:10] "Arts" NA "Arts" "Arts" ...
  .. ..$ subsection_name : logi [1:10] NA NA NA NA NA NA ...
  .. ..$ type_of_material: chr [1:10] "News" "Article" "News" "News" ...
  .. ..$ _id             : chr [1:10] "4fd22af78eb7c8105d7b61c1" "4fc4631845c1498b0d991921" "4fd2184b8eb7c8105d7946fb" "4fd22d668eb7c8105d7ba7d2" ...
  .. ..$ word_count      : int [1:10] 1800 619 924 3540 477 2547 1746 309 461 152
  .. ..$ headline.main   : chr [1:10] "Garcia's Shadow, but His Own Sound; Jimmy Herring, the Dead's New Member, Helps Reinterpret Old Songs" "Musical Revolution" "A Welcome Truce For Jazz Festivals" "Watching 'Jazz' for Its High Notes and Low" ...
  .. ..$ headline.kicker : chr [1:10] NA NA "POP/JAZZ" "MUSIC" ...
  .. ..$ byline.person   :List of 10
  .. ..$ byline.original : chr [1:10] "By BEN RATLIFF" "By MICHAEL ZWERIN" "By Peter Watrous" NA ...
 $ status   : chr "OK"
 $ copyright: chr "Copyright (c) 2013 The New York Times Company.  All Rights Reserved."
# Get number of hits
print(response.df$response$meta$hits)
[1] 27

Of course, most experiences using Web APIs will require multiple GET requests, each different from the next. To speed this process along, we can create a function that can take several arguments and then automatically generate a properly formated GET request URL. Here, for instance, is one such function we might write:

# Write a function to create get requests
nytapi<-function(search.terms=NULL, begin.date=NULL, end.date=NULL, page=NULL,
                     base.url="http://api.nytimes.com/svc/search/v2/articlesearch",
                     response.format=".json",
                     key="ef9055ba947dd842effe0ecf5e338af9:15:72340235"){

  # Combine parameters
  params<-list(
    c("q", search.terms),
    c("begin_date", begin.date),
    c("end_date", end.date),
    c("page", page)
  )
  params<-params[sapply(X = params, length)>1]
  params<-sapply(X = params, FUN = paste0, collapse="=")
  params<-paste0(params, collapse="&")
  
  # URL encode query portion
  query<-URLencode(URL = params, reserved = FALSE)

  # Combine with base url and other options
  get.request<-paste0(base.url, response.format, "?", query, "&api-key=", key)
  
  # Send GET request
  response<-httr::GET(url = get.request)
  
  # Parse response to JSON
  response<-httr::content(response, "text")  
  response<-jsonlite::fromJSON(txt = response, simplifyDataFrame = T, flatten = T)
  
  return(response)
}

Now that we have our handy NYT API function, let’s try and do some data analysis. To figure out whether Duke Ellington is “trending” over the past few years, we can start by using our handy function to get a count of how often the New York Times mentions the Duke…

# Get number of hits, number of page queries
duke<-nytapi(search.terms = "duke ellington", begin.date = 20050101, end.date = 20150101)
hits<-duke$response$meta$hits
print(hits)
[1] 1059
round(hits/10)
[1] 106

After making a quick call to the API, it appears that we have a total of 1059 hits. Since the API only allows us to download 10 results at a time, we need to make 106 calls!

# Get all articles   
duke.articles<-sapply(X = 0:105, FUN = function(page){
  #cat(page, "")
  response<-tryCatch(expr = {
    r<-nytapi(search.terms = "duke ellington", begin.date = 20050101, end.date = 20150101, page = page)
    r$response$docs
  }, error=function(e) NULL)
  return(response)
})

# Combine list of dataframes
duke.articles<-duke.articles[!sapply(X = duke.articles, FUN = is.null)]
duke.articles<-plyr::rbind.fill(duke.articles)

To figure out how Duke’s popularity is changing over time, all we need to do is add an indicator for the year and month each article was published in, and then use the plyr package to count how many articles appear with each year-month combination:

# Add year-month indicators
duke.articles$year.month<-format(as.Date(duke.articles$pub_date), "%Y-%m")
duke.articles$year.month<-as.Date(paste0(duke.articles$year.month, "-01"))

# Count articles per month
library(plyr)
duke.permonth<-ddply(.data = duke.articles, .variables = "year.month", summarize, count=length(year.month))

# Plot the trend over time
library(ggplot2)
ggplot(data = duke.permonth, aes(x = year.month, y = count))+geom_point()+geom_smooth(se=F)+
  theme_bw()+xlab(label = "Date")+ylab(label = "Article Count")+ggtitle(label = "Coverage of Duke Ellington")
geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

Looks like he actually is getting more popular of late!